6 research outputs found

    Knowledge management overview of feature selection problem in high-dimensional financial data: Cooperative co-evolution and Map Reduce perspectives

    Get PDF
    The term big data characterizes the massive amounts of data generation by the advanced technologies in different domains using 4Vs volume, velocity, variety, and veracity-to indicate the amount of data that can only be processed via computationally intensive analysis, the speed of their creation, the different types of data, and their accuracy. High-dimensional financial data, such as time-series and space-Time data, contain a large number of features (variables) while having a small number of samples, which are used to measure various real-Time business situations for financial organizations. Such datasets are normally noisy, and complex correlations may exist between their features, and many domains, including financial, lack the al analytic tools to mine the data for knowledge discovery because of the high-dimensionality. Feature selection is an optimization problem to find a minimal subset of relevant features that maximizes the classification accuracy and reduces the computations. Traditional statistical-based feature selection approaches are not adequate to deal with the curse of dimensionality associated with big data. Cooperative co-evolution, a meta-heuristic algorithm and a divide-And-conquer approach, decomposes high-dimensional problems into smaller sub-problems. Further, MapReduce, a programming model, offers a ready-To-use distributed, scalable, and fault-Tolerant infrastructure for parallelizing the developed algorithm. This article presents a knowledge management overview of evolutionary feature selection approaches, state-of-The-Art cooperative co-evolution and MapReduce-based feature selection techniques, and future research directions

    Infrequent pattern detection for reliable network traffic analysis using robust evolutionary computation

    Get PDF
    While anomaly detection is very important in many domains, such as in cybersecurity, there are many rare anomalies or infrequent patterns in cybersecurity datasets. Detection of infrequent patterns is computationally expensive. Cybersecurity datasets consist of many features, mostly irrelevant, resulting in lower classification performance by machine learning algorithms. Hence, a feature selection (FS) approach, i.e., selecting relevant features only, is an essential preprocessing step in cybersecurity data analysis. Despite many FS approaches proposed in the literature, cooperative co-evolution (CC)-based FS approaches can be more suitable for cybersecurity data preprocessing considering the Big Data scenario. Accordingly, in this paper, we have applied our previously proposed CC-based FS with random feature grouping (CCFSRFG) to a benchmark cybersecurity dataset as the preprocessing step. The dataset with original features and the dataset with a reduced number of features were used for infrequent pattern detection. Experimental analysis was performed and evaluated using 10 unsupervised anomaly detection techniques. Therefore, the proposed infrequent pattern detection is termed Unsupervised Infrequent Pattern Detection (UIPD). Then, we compared the experimental results with and without FS in terms of true positive rate (TPR). Experimental analysis indicates that the highest rate of TPR improvement was by cluster-based local outlier factor (CBLOF) of the backdoor infrequent pattern detection, and it was 385.91% when using FS. Furthermore, the highest overall infrequent pattern detection TPR was improved by 61.47% for all infrequent patterns using clustering-based multivariate Gaussian outlier score (CMGOS) with FS

    Cooperative co-evolution-based feature selection for big data analytics

    No full text
    The rapid progress of modern technologies generates a massive amount of highthroughput data, called Big Data, which provides opportunities to find new insights using machine learning (ML) algorithms. Big Data consist of many features (attributes). However, irrelevant features may degrade the classification performance of ML algorithms. Feature selection (FS) is a combinatorial optimisation technique used to select a subset of relevant features that represent the dataset. For example, FS is an effective preprocessing step of anomaly detection techniques in Big Cybersecurity Datasets. Evolutionary algorithms (EAs) are widely used search strategies for feature selection. A variant of EAs, called a cooperative co-evolutionary algorithm (CCEA) or simply cooperative co-evolution (CC), which uses a divide-and-conquer approach, is a good choice for large-scale optimisation problems. The goal of this thesis is to investigate and develop three key research issues related to feature selection in Big Data and anomaly detection using feature selection in Big Cybersecurity Data. The first research problem of this thesis is to investigate and develop a feature selection framework using CCEA. The objective of feature selection is twofold: selecting a suitable subset of features or in other words, reducing the number of features to decrease computations and improving classification accuracy, which are contradictory, but can be achieved using a single objective function. Using only classification accuracy as the objective function for FS, EAs, such as CCEA, achieves higher accuracy, even with a higher number of features. Hence, this thesis proposes a penalty-based wrapper single objective function. This function has been used to evaluate the FS process using CCEA, henceforth called Cooperative Co-Evolutionary Algorithm-Based Feature Selection (CCEAFS). Experimental analysis was performed using six widely used classifiers on six different datasets, with and without FS. The experimental results indicate that the proposed objective function is efficient at reducing the number of features in the final feature subset without significantly reducing classification accuracy. Furthermore, the performance results have been compared with four other state-of-the-art techniques. CC decomposes a large and complex problem into several subproblems, optimises each subproblem independently, and collaborates different subproblems only to build a complete solution of the problem. The existing decomposition solutions have poor performance because of some limitations, such as not considering feature interactions, dealing with only an even number of features, and decomposing the dataset statically. However, for real-world problems without any prior information about how the features in a dataset interact, it is difficult to find a suitable problem decomposition technique for feature selection. Hence, the second research problem of this thesis is to investigate and develop a decomposition method that can decompose Big Datasets dynamically, and can ensure the probability of grouping interacting features into the same subcomponent. Accordingly, this thesis proposes a random feature grouping (RFG) with three variants. RFG has been used in the CC-based FS process, hence called Cooperative Co-Evolution-Based Feature Selection with Random Feature Grouping (CCFSRFG). Experiment analysis performed using six widely used ML classifiers on seven different datasets, with and without FS, indicates that, in most cases, the proposed CCFSRFG-1 outperforms CCEAFS and CCFSRFG-2, and also does so when using all features. Furthermore, the performance results have been compared with five other state-of-theart techniques. Anomaly detection from Big Cybersecurity Datasets is very important; however, this is a very challenging and computationally expensive task. Feature selection in cybersecurity datasets may improve and quantify the accuracy and scalability of both supervised and unsupervised anomaly detection techniques. The third research problem of this thesis is to investigate and develop an anomaly detection approach using feature selection that can improve the anomaly detection performance, and also reduce the execution time. Accordingly, this thesis proposes an Anomaly Detection Using Feature Selection (ADUFS) to deal with this research problem. Experiments were performed on five different benchmark cybersecurity datasets, with and without feature selection, and the performance of both supervised and unsupervised anomaly detection techniques were investigated by ADUFS. The experimental results indicate that, instead of using the original dataset, a dataset with a reduced number of features yields better performance in terms of true positive rate (TPR) and false positive rate (FPR) than the existing techniques for anomaly detection. In addition, all anomaly detection techniques require less computational time when using datasets with a suitable subset of features rather than entire datasets. Furthermore, the performance results have been compared with six other state-of-the-art techniques

    Access methods for Big Data: current status and future directions

    No full text
    Heterogeneity, size, timeliness, difficulty & confidentiality problems with Big Data hinder advancement at all phases of the channel that can create value from data. Data analysis, organization, retrieval & modeling are initial challenges for Big Data. Data investigation is a clear traffic jam in many applications, both due to lack of scalability of the core algorithms and due to the difficulty of the data that needs to be analyzed. Despite this, the appearance of the results and its understanding by non-technical experts is vital to extracting actionable knowledge. To defeat these, there is a need for novel architectures, techniques, algorithms & analytics to deal with it as well as to retrieve the value and unseen knowledge. Further, we need to build up efficient and optimized access methods for countless reasons such as velocity of Big Data. In this article, we present a brief overview of the current status of access methods for Big data and discuss a few promising research directions

    Anomaly detection in cybersecurity datasets via cooperative co-evolution-based feature selection

    No full text
    Anomaly detection from Big Cybersecurity Datasets is very important; however, this is a very challenging and computationally expensive task. Feature selection (FS) is an approach to remove irrelevant and redundant features and select a subset of features, which can improve the machine learning algorithms’ performance. In fact, FS is an effective preprocessing step of anomaly detection techniques. This article’s main objective is to improve and quantify the accuracy and scalability of both supervised and unsupervised anomaly detection techniques. In this effort, a novel anomaly detection approach using FS, called Anomaly Detection Using Feature Selection (ADUFS), has been introduced. Experimental analysis was performed on five different benchmark cybersecurity datasets with and without feature selection and the performance of both supervised and unsupervised anomaly detection techniques were investigated. The experimental results indicate that instead of using the original dataset, a dataset with a reduced number of features yields better performance in terms of true positive rate (TPR) and false positive rate (FPR) than the existing techniques for anomaly detection. For example, with FS, a supervised anomaly detection technique, multilayer perception increased the TPR by over 200% and decreased the FPR by about 97% for the KDD99 dataset. Similarly, with FS, an unsupervised anomaly detection technique, local outlier factor increased the TPR by more than 40% and decreased the FPR by 15% and 36% for Windows 7 and NSL-KDD datasets, respectively. In addition, all anomaly detection techniques require less computational time when using datasets with a suitable subset of features rather than entire datasets. Furthermore, the performance results have been compared with six other state-of-the-art techniques based on a decision tree (J48)
    corecore